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Abstract 

This  paper  describes  a  prototype  system  for 
near-real-time  spontaneous,  bidirectional 
translation  between  spoken  English  and 
Pashto,  a  language  presenting  many 
technological  challenges  because  of  its  lack  of 
resources,  including  both  data  and  expert 
knowledge.  Development  of  the  prototype  is 
ongoing,  and  we  propose  to  demonstrate  a 
fully  functional  version  which  shows  the  basic 
capabilities,  though  not  yet  their  final  depth 
and  breadth. 


1  Introduction 

This  demonstration  will  present  a  prototype  system  for 
bidirectional  speech-to-speech  translation  within  a 
limited  semantic  domain,  that  of  first  encounters 
between  a  patient  and  a  medical  professional.  A  major 
goal  of  the  work  is  to  explore  techniques  that  are 
appropriate  for  languages  that  are  not  of  great 
commercial  interest  and  that  consequently  are  lacking  in 
data  and  resources  of  many  kinds.  The  system  has  been 
developed  for  American  English  and  Pashto,  one  of  the 
major  languages  of  Afghanistan,  which  presents  a 
variety  of  challenges  for  both  data-intensive  and 
knowledge-based  approaches. 

It  should  be  noted  that  the  system  must  be  viewed  as 
one  component  in  a  real-time  transaction  between  two 
cooperating  humans.  The  ultimate  goal  of  those  humans 


is  to  exchange  information  by  whatever  means  is 
effective:  not,  necessarily,  to  rely  exclusively  on  the 
system’s  output,  but  to  use  it  in  combination  with  other, 
non-speech  modalities  of  conveying  meaning  and  with 
ordinary  world  knowledge. 

The  final  system  is  intended  to  run  on  a  handheld 
device,  such  as  a  PDA,  with  its  attendant  memory  and 
speed  limitations  and  restriction  to  integer-only 
computation.  Most  components  of  the  demonstration 
system  run  in  a  PocketPC  emulation  environment  on  a 
Windows  laptop,  with  a  few  in  the  full  Windows 
environment.  All  aspects  of  the  prototype  system  are 
undergoing  active  development. 

2  Overall  Architecture 

A  simple  description  of  the  architecture  is  as  follows. 
The  system  is  controlled  by  the  English  speaker,  who  is 
expected  to  have  greater  technological  familiarity  and 
who  has  the  benefit  of  visual  feedback  on  the  system 
performance.  Spoken  input,  in  either  English  or  Pashto, 
is  recognized  by  SRTs  small-footprint  DynaSpeak® 
recognizer,  and  an  ordered  list  of  hypotheses  is 
produced.  The  most  likely  hypothesis  is  input  to  SRTs 
Gemini  natural  language  parser/generator  (Dowding  et 
al.  1993),  which  attempts  to  parse  the  speech 
recognition  output.  Handling  of  possible  errors  or 
failures  will  be  discussed  in  Section  3. 

When  a  successful  parse  is  obtained,  Gemini  creates 
a  quasi-logical  form  representing  the  meaning  of  the 
sentence.  In  general,  multiple  quasi-logical  forms  may 
be  created,  reflecting  differing  interpretations  of  the 
input  sentence.  These  forms,  which  are  domain 
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independent  and  serve  here  as  an  interlingua,  can  be 
ordered  by  heuristically  assigning  preferences  or 
dispreferences  to  the  parsing  rules  applied  to  create 
them.  Gemini  uses  a  grammar  for  the  target  language  to 
generate  a  translation  from  the  most-preferred 
interpretation  possible,  and  outputs  a  textual 
representation  of  the  translation. 

Theta,  a  small-footprint  concatenative  synthesizer 
from  Cepstral  LLC  (Cepstral  LLC  2004),  then  produces 
synthetic  spoken  output  in  the  target  language  from  the 
textual  representation  generated  by  Gemini.  The  Pashto 
voice  was  created  specifically  for  this  project. 

An  English  and  a  Pashto  version  of  each  component 
are  called  by  a  single  application  which  includes  a 
graphical  user  interface.  A  screen  shot  of  the  interface 
is  shown  in  Figure  1 . 
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Figure  1.  Screen  shot  of  prototype  system 
interface. 

3  Redundancy  and  Handling  Infelicities 

As  in  any  complex  system,  performance  can  differ  from 
the  ideal  in  a  number  of  ways,  and  it  is  important  for  the 
system  to  provide  alternative  ways  to  support  successful 
communication.  Some  kinds  of  subideal  performance 
and  recovery  approaches  are  described  here. 

The  most  likely  hypothesis  output  by  the  speech 
recognizer  may  not  be  exactly  what  was  spoken.  When 
the  input  speech  is  English,  the  English  speaker  can  see 
whether  the  most  likely  hypothesis  (shown  in  the 
"Input"  box  in  Figure  1)  is  correct  or  approximately 
correct.  If  the  English  speaker  judges  the  hypothesis  to 
not  be  close  enough  to  the  intended  text,  s/he  may  either 
repeat  the  utterance  or  click  on  the  "Guesses..."  button 
to  see  an  ordered  list  of  the  best  hypotheses.  A  sample 
list  is  shown  in  Figure  2.  If  the  correct  text  is  on  this  list, 
the  user  may  select  it  to  submit  for  translation.  When 


the  input  speech  is  in  Pashto,  this  functionality  is  less 
useful,  as  the  Pashto  speaker  is  not  assumed  to  be  able 
to  read,  even  if  the  hypotheses  were  displayed  in  Pashto 
script. 


Figure  2.  Sample  ordered  list  of  recognizer 
hypotheses. 

Once  a  recognition  hypothesis  has  been  submitted 
for  translation,  several  possible  problems  can  arise. 
Pashto  is  a  moderately  inflected,  split-ergative  Indo- 
European  language,  and  for  Pashto  in  particular, 
recognition  errors  may  lead  to  apparent  lack  of  syntactic 
agreement  between  elements  of  the  sentence  which 
should  (and  did  in  fact)  agree.  As  Gemini  tries  to 
generate  a  full  parse  of  the  input,  it  has  the  option  of 
using  parse  rules  that  relax  agreement  requirements. 
These  rules  are  dispreferred  and  a  parse  built  upon  them 
may  be  kept  only  if  a  full,  grammatically  correct  parse 
cannot  be  completed.  Another  possible  problem  is  that 
unknown  words,  some  grammatical  constructions,  and 
input  errors  may  render  any  full  parse  unachievable.  In 
this  case,  fallback  strategies  can  be  applied  to  translate 
partial  parses,  fragments,  or  any  known  words.  Other 
strategies  are  currently  in  development. 

Another  class  of  approaches  for  assisting 
communication  allows  the  English  speaker  to  quickly 
perform  certain  actions  or  play  high-frequency  phrases 
to  the  Pashto  speaker.  If  there  is  background  noise  or 
distractions  or  the  TTS  quality  is  not  high  enough  for 
easy  comprehension,  pressing  the  "Replay"  button  will 
immediately  play  back  the  last  translation  result.  "Ask 
for  Rep"  plays  a  prerecorded  sentence  asking  the  Pashto 
speaker  to  repeat  what  s/he  said.  Several  other  useful 
phrases  are  available  by  clicking  on  the  "Phrases..." 
button  and  then  selecting  from  the  screen  shown  in 
Figure  3. 


Figure  3.  Prerecorded  Pashto  phrases  which  can 
he  played  hack  with  a  single  click. 


na,  rnUSkel  lar@m 

no  I  have  the  problem 

Do  you  have  pain? 

AyA  t@  dard  lare 

zAyt 

of  much  [minor 

misrecognition;  translation 
should  have  been  "much"] 

Do  you  take  medications? 

AyA  t@  dawAuna  akhle 
[incorrect  plural  form  of 
dawA,  but  understood  by 
Pashto  speaker  anyway] 

na 

no 

Do  you  want  pain 
medication? 

AyA  t@  d@  dard  dawA 
ghwArre 

ho 

yes 

Do  you  have  allergies? 

AyA  t@  hasAsiyatuna  lare 

ho 

yes 

What  are  your  allergies? 

St  A  hasAsiyat  ts@  day 

antibyutik 

[misrecognized  but 

correctly  understood  by 
doctor  as  "antibiotics"] 

4  Sample  Interaction 

The  table  below  shows  an  excerpt  of  a  dialog  between 
an  English  and  a  Pashto  speaker,  both  new  to  this 
system,  whose  interaction  was  part  of  an  informal  trial 
run  by  the  MITRE  Corporation  in  Eebruary  2004  for  the 
DARPA  CAST  program.  The  English  speaker,  a  doctor, 
had  just  received  eighteen  minutes  of  training  in  how  to 
use  the  system  and  had  had  no  other  exposure  to  it.  The 
Pashto  speaker,  playing  the  role  of  an  injured  patient, 
had  received  training  in  complaints  consistent  with  a 
particular  injury  scenario  and  had  seen  others  use  the 
system,  but  had  not  interacted  with  the  system  himself 
Except  as  noted,  the  translations  are  understandable. 


Spoken  input 

System  result 

I  am  a  doctor,  can  I  help 
you? 

[failure  to  translate 

because  two  sentences] 

I  am  a  doctor 

z@  ddAkttar  y@m 

Can  I  help  you? 

AyA  z@  d@  tA  sara 
komak  kaw@lay  S  @m 

ho,  mehrabAni  w@krra 

yes  make  [partial 

translation;  full  translation 
should  have  been  "yes 
please  do"] 

Where  are  you  hurt? 

[rerecorded  after  one 
misrecognition]  t@  cherta 
khugx  ye 

ghwagx  aw  ugxa 

[misrecognized 
repeatedly;  unable  to  give 
meaningful  translation; 

should  have  been  "ear  and 
shoulder"] 

Can  you  breathe? 

AyA  t@  sA  akhlst@lay 
Se 

5  Challenges 

Three  main  challenges  face  this  project.  Eirst, 
commercially  nonviable  languages,  such  as  Pashto, 
often  offer  very  few  linguistic  resources  (such  as 
linguistic  descriptions,  acoustic  data,  texts,  language 
processing  tools).  The  lack  of  resources  makes 
development  more  difficult,  and  severely  constrains  the 
approaches  that  are  viable:  approaches  that  rely  on  large 
corpora  cannot  be  used.  In  addition,  there  is  a  shortage 
of  literate  speakers  who  are  available  to  act  as 
consultants,  and  a  scarcity  of  basic  knowledge  about  the 
language.  This  impedes  progress  and  renders  difficult 
approaches  that  rely  on  a  large  body  of  hand-crafted 
translation  rules.  The  challenge  of  having  no  single 
person  who  has  a  deep  understanding  of  both  the 
language  and  the  technology  and  who  can  serve  as  a 
bridge  between  them  is  substantial,  and  causes  more 
iterative  development  than  is  ordinarily  the  case  when 
bilingual  technologists  are  available,  as  newly 
discovered  phenomena  or  new  understanding  cause 
revisions  of  previous  work. 

A  second  major  challenge  is  due  to  the  nature  of  the 
domain  and  the  underlying  concept  of  operations.  Real 
speech  occurs  in  noisy  environments,  has  disfluencies, 
and  is  highly  variable  (e.g.,  phrasings,  dialect 
differences).  In  addition,  the  output  of  a  speech 
recognizer  contains  more  and  qualitatively  different 
errors  than  typical  text  input  to  automatic  translation 
systems.  While  the  problems  of  real  speech  are  not 
unique  to  this  project,  they  are  magnified  by  the  fact 
that  the  non-English  speakers  will  largely  be 
unsophisticated  users  of  technology,  who  will  often  be 
using  the  system  for  the  first  and  only  time.  The  system 


must  work  well  from  the  very  first  utterance  -  there 
cannot  be  much  of  a  learning  curve.  This  applies  to  the 
translation  quality  and  to  other  aspects  of  the  system, 
such  as  the  synthetic  speech,  as  speakers  are  not  familiar 
with  synthesized  Pashto  speech.  These  speakers  are  also 
not  expected  to  be  literate,  and  their  understanding  will 
not  be  bolstered  by  the  extra  redundancy  and 
capabilities  that  the  display  offers  to  the  English 
speaker. 

A  third  major  challenge  is  posed  by  the  handheld 
device  platform  with  its  computational  and  storage 
limitations,  and  the  near-real-time  requirement  of  the 
envisioned  usage.  The  restriction  to  integer-only 
computation  is  most  serious  for  the  speech  recognition, 
as  nearly  all  medium-  or  large-vocabulary  speech 
recognizers  perform  extensive  floating-point 
computations,  and  the  conversion  of  a  speech  recognizer 
to  use  only  integer  computation  required  considerable 
effort.  The  severe  memory  limitations  perhaps  impact 
most  the  parsing/generation  components  of  the  system. 

6  Summary 

We  have  described  a  prototype  of  a  spoken  language 
translation  system  for  use  by  English-speaking  medical 
personnel  treating  Pashto-speaking  patients.  The 
system,  which  is  targeted  to  a  handheld  device,  is  being 
developed  with  extremely  little  Pashto-language  data  of 
any  kind.  It  builds  on  medium-vocabulary  speaker- 
independent  HMM-based  speech  recognition,  rule- 
based  translation  and  supporting  methods,  and 
concatenative  synthesis.  While  the  system  is  certainly 
still  under  development,  it  has  provided  a  reasonable 
proof  of  concept  in  informal  trials. 
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